All Databases MacTech Vol 02-1986

Float Point

Volume Number: 2

Issue Number: 7

Column Tag: Threaded Code

A Custom Floating Point Package

By Jörg Langowski, EMBL, c/o I.L.L., Grenoble, Cedex, France,

Editorial Board

"12 Kflops - Forth goes inSANE

You might have noticed my frequent complaints about the slowness of the Apple

80-bit SANE routines and the non-availability of a reasonably fast 32 bit floating point

package in Forth systems (and most other programming languages, for that matter,

Fortran being the only exception).

The scaling that is necessary if you want to do rapid calculations in Forth (using

integer arithmetic) is a nuisance, and it is almost impossible to implement general

purpose mathematical routines without floating point. Since I needed to run this

machine as fast as possible, I set out to write a set of single precision real arithmetic

routines.

IEEE 32-bit real numbers

The number format that we are dealing with is the IEEE standard and defined as

follows:

bit 31 (highest bit): sign (0 = positive)

bits 3023 : exponent (offset = 127)

bits 220 : fraction part (23 bits with the 24th bit always =1)

That is, a real number has the form

(+/-)1.xxxxxx * 2yyy,

where the exponent yyy is obtained by subtracting 127 from the stored exponent; only

the fraction part of the mantissa is stored because the integer part is always 1 by

definition (otherwise, the fraction part will have to be shifted left and the exponent

decreased until the integer part is =1; this is called normalization).

The highest exponent, 255, is reserved for special cases: with zero mantissa it

designates positive/negative infinity, and with non-zero mantissa it is a 'no-number'

(Not a Number or NaN). This latter case is used to mark the results of undefined or

illegal operations, such as 0/0, infinity - infinity, square root (-x), etc. The type of

'non- numberness' is indicated by the value of the mantissa.

Floating point arithmetic is thouroughly treated in D. Knuth's 'The Art of

Computer Programming', Vol.2. Taking this excellent book as a guidance, the job to

write the single precision routines becomes manageable. We'll first consider addition.

Floating point addition and subtraction

To add two floating point numbers which have the same exponent is trivial, we

simply add the fraction parts. Unfortunately, in real life numbers often have different

exponents. Adding such two numbers is done by shifting the fraction part of the smaller

one right by the difference in exponents. Then the fractions may be added; the exponent

of the result is that of the larger of the two input numbers.

This addition may generate a fraction overflow, when the new fraction part

overflows into the 25th bit. In that case, the fraction has to be shifted right and the

exponent increased by one. This might generate an exponent overflow; in which case we

have to set the result to infinity.

When adding two numbers of opposite sign, the fractions will have to be

subtracted and the resultant fraction may be significantly smaller than 24 bit

precision. We have to normalize the result, shifting the fraction left until the 24th bit

is equal to one again, decreasing the exponent as we go. We assume that the input to our

routines always consists of normalized floating point numbers; the routines will always

return normalized results as well.

Normalizing after subtraction may result in an exponent underflow. We could

simply stop normalization and keep an unnormalized fraction part for very small

numbers (in fact, the IEEE standard provides for this). I have chosen here to simply set

the result equal to zero, since the algorithms become more complicated if one allows

unnormalized small numbers.

After normalization, the fraction part is rounded. This is necessary because we

actually calculate the result with a higher precision than 24 bits; in general, we will

have a 'raw' 32-bit result. The rounding algorithm implemented here goes as follows

(bit 0 is the least significant bit of the 24 bit mantissa, bits (-1) and (-2) the bits

below it):

- if bit (-1) is equal to zero, don't change anything.

- if bits (-1) and (-2) are equal to one, increment fraction by one.

- if bit (-1) = 1, bit (-2) = 0, increment fraction if bit 0 = 1 (round to next

even number).

Since rounding includes incrementing, we might generate a new fraction

overflow, so we have to check for that again after rounding.

Getting the correct sign for the result is not too complicated, either; if both

numbers have the same sign, the result will have that sign. If the numbers are of

opposite sign, the fractions have to be subtracted instead of added. If because of the

subtraction the result fraction becomes negative, the fraction and the sign of the result

have to be inverted.

The result of the subtraction then is normalized and rounded as described before.

The algorithm is implemented in Listing 1 in standard 68000 assembly code of

the Mach1 variety. My code assumes that registers D0 to D6 may be used freely, which

is true for Mach1. If your Forth system (or any other system where you wish to

implement this code) uses any of those registers, you will have to push them on the

stack before you enter and pull them off again on exit. Or change register assignments.

The Mach1 data stack is maintained by A6, so the operands are pulled off the A6

stack on entry and the sum pushed there on exit. This might also be a place where

changes would be necessary going to a different system. Otherwise, you can take the code

as it is. In MacForth, you will have to transfer it into reverse polish notation. NEON

provides a standard assembler.

Referenced by (3):